Title:

Exploratory Data Analysis

Description:

Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns,to spot anomalies,to test hypothesis and to check assumptions with the help of summary statistics and graphical representations.

In this notebook we will define the targe we would like to predict, study Nan values and outliers, drop duplicates and variables we do not want for the analysis...

Authors:

Hugo Cesar Octavio del Sueldo

Jose Lopez Galdon

Date:

04/12/2020

Version:

1.0


Libraries

We are going to start this EDA loading the respective libraries


Load data


Variable predict

Taking into account the goals of the practice, we only want to know those persons that Fully paid their loans or those who does not, so we will tansform the dependent variable into binary problem, to ones or ceros depending on that condition.

So we are going to predict the variable loan_status which are classified in different sections:

As we know, our problem is to predict who is going to pay or not. Due to this we will select Fully paid, Default and Charged Off. default and Charged Off will be treated as the same variable, because they almost explain the same condition:

Loans that are in "Default" are loans for which borrowers have failed to make payments for an extended period of time.

A loan becomes “Charged Off” when there is no longer a reasonable expectation of further payments. Charge Off typically occurs when a loan is 120 days or more past due and there is no reasonable expectation of sufficient payment to prevent the charge off. In certain circumstances, loans may be charged off at an earlier or later date. Please note, loans for which borrowers have filed for bankruptcy may be charged off earlier based on the date of bankruptcy notification.

A loan that is in “Default” will still appear in your Notes, in the status of “Default,” while a loan that has been “Charged Off” will appear as charged off, and the remaining principal balance of the Note will be deducted from your account balance.

As you can see, we have a lot of loans which are current with fair amount of fully paid loans. Other categories (including) default have a really low number. This could mean that the data is imbalanced and we might need to do something about this later in the analysis. Therefore, we will merge Charged Off and Default, this will be our non-payment variable.

We will perform the same graph...

As we said before we may have an imbalanced dataset, if we run the algorithms with this proportion it will bias to the ceros(payment), so if we recieve a new client will have more probability to be a cero than a one, and that could lead us to the wrong decission.

View data

Firstly, we are goint to take a look to our dataframe:

Data wrangling

After studying the whole dictionary, place in references/dictionary.md, we saw that there are some variables that we can not use them, because is information a posteriori, some examples of these columns are: collection_recovery_fee, installment, last_credit_pull_d, last_pymnt_amnt...

Once we have viewed our dataset, we continue exploring the column types and nulls...

As we can see above we have plenty of columns with nulls, so we are going to drop those columns which more than a 75% of the information is null.

We can see that now we have only 64 columns, instead of 151. This is a huge reduction of the variables, but we continue having too much, in order to reduce them, we will study which of them are important in order to predict loan_status.

Other method we can use in order to select those variables which are important is, we could use a correlation matrix and select those whith hight correlation values, but with 65 columns will be too difficult... so we will use a smarter way:

Select those variables with more than 80% of correlation between them and drop them

Now we will drop those columns which are not important to predict loan_status, using our business knowlege, some examples of them are url, policy_code, title, purpose...

Recursive Featuring Elimination

As we can see we have reduced the dataset a 33% but we still have to much columns, so, we are going to use RFE (Recursive Featuring Elimination) algorithm that is basically a backward selection of the predictors. This technique begins by building a model on the entire set of predictors and computing an importance score for each predictor. The least important predictor(s) are then removed, the model is re-built, and importance scores are computed again. In practice, the analyst specifies the number of predictor subsets to evaluate as well as each subset’s size. Therefore, the subset size is a tuning parameter for RFE. The subset size that optimizes the performance criteria is used to select the predictors based on the importance rankings. The optimal subset is then used to train the final model.

First of all, in order to run this algorithm we have to modify the columns types because it is a must to codify the categorical variables. Therefore, we will move forward analysing those columns that could changed for categorical and those that have symbols that could affect the code in a next step.

Now, lets check the results of the functions created above

To move forward with the RFE algorithm we need to drop the NaN's because it doesn't run with them, so let use the dropna function.

As we can observe, we change from 442567 rows to 407319.

Feature scaling

Due to the RFE algorithm it's so intensive in compute, and therefore expend to much time to drop the results we will work, at this stage, with a sample of 1000 random rows.

Finally, we can start with the RFE algorithm to select the variables

Once the execution finishes, you can use this line of code to see how many features are optimal to produce the best accuracy (or whatever your chosen metric is):

Not only this, but you can also plot the accuracy obtained with every number of features used:

It is visible that with 29 features the accuracy was close to 80%, which is very good taking account that we initially have 150 variables.

You can also print out which features are considered to be least important and drop them with this snippet:

The instance of RFECV has also a nifty feature_importances attribute which is worthy to be checked out

Okay, okay, don’t scream at me just yet. I know that this doesn’t tell you much. Thankfully, this is pretty easy to visualize. One way is to create a DataFrame object with attributes as one column and the importance as the other, and then just simply sort the DataFrame by importance in descending order. Then you can use the power of plotting libraries such as Matplotlib to draw a Bar chart (horizontal is preferred for this scenario) to get a nice visual representation. Here’s the code for doing so:

At this stage, the most important variable is the int_rate(interest rate), this make sense because when we have a client with high interest rate, this is due to a high risk. The interest rate depends of the total amount of the loan, the term, the emp_title, emp_length... so on.

The number of mortage accounts (mort_acc) it isn't signicate explaining the loaning state. That is a little weird because a person with multiple mortage account could be a risky customer and probably a default borrower.


Exploratory Data Analysis

In statistics, exploratory data analysis is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task.

Now, we will check the data info.

We can see that some of theme are not in the correct type, so we will modify them.

Authomatic report

Firstly, we are going to perform an authomatic EDA report, with the Sweetviz:

Correlations

A correlation coefficient is a numerical measure of some type of correlation, meaning a statistical relationship between two variables. The variables may be two columns of a given data set of observations, often called a sample, or two components of a multivariate random variable with a known distribution.

Several types of correlation coefficient exist, each with their own definition and own range of usability and characteristics. They all assume values in the range from −1 to +1, where ±1 indicates the strongest possible agreement and 0 the strongest possible disagreement.

Heatmap

The following variables are highly correlated:

Pairplot

We made a pairplot in order to visualize those correlations and we can conclude that is not as clear as we thought initially, so we are not going to drop any of them.

Outliers

Boxplots

In descriptive statistics, a box plot or boxplot is a method for graphically depicting groups of numerical data through their quartiles. Box plots may also have lines extending from the boxes (whiskers) indicating variability outside the upper and lower quartiles, hence the terms box-and-whisker plot and box-and-whisker diagram. Outliers may be plotted as individual points.

In the plot above, we can see that we have some outliers when term = 0 (36 months), later we will study them...

In this case, we have some outliers in the forth type. Also we can see that the funded amount is higher when you are in the second one.

We can see that we have some states with outliers. Apart from that, there are some states with very high amounts, as the 0.

Here we have no outliers.

We have some outliers in this variables, and is interesante that does not deppends on how many years have you been working, the mean funded amount is between 10000 and 15000.

Outlier detection and dropping

After researching some outlier detection methods, we conclude that we will perform the Interquartile range (IQR):

The interquartile range (IQR), also called the midspread or middle 50%, or technically H-spread, is a measure of statistical dispersion, being equal to the difference between 75th and 25th percentiles, or between upper and lower quartiles, IQR = Q3 − Q1. In other words, the IQR is the first quartile subtracted from the third quartile; these quartiles can be clearly seen on a box plot on the data. It is a measure of the dispersion similar to standard deviation or variance, but is much more robust against outliers.

As we now have the IQR scores, it’s time to get hold on outliers. The below code will give an output with some true and false values. The data point where we have False that means these values are valid whereas True indicates presence of an outlier.

We have lots of outliers, but we do not know if there are influyent, in next notebooks we will study how to detect infuyent outliers and what do with them.